Imputation of missing values for cochlear implant candidate audiometric data and potential applications

Objective Assess the real-world performance of popular imputation algorithms on cochlear implant (CI) candidate audiometric data. Methods 7,451 audiograms from patients undergoing CI candidacy evaluation were pooled from 32 institutions with complete case analysis yielding 1,304 audiograms. Imputation model performance was assessed with nested cross-validation on randomly generated sparse datasets with various amounts of missing data, distributions of sparsity, and dataset sizes. A threshold for safe imputation was defined as root mean square error (RMSE) <10dB. Models included univariate imputation, interpolation, multiple imputation by chained equations (MICE), k-nearest neighbors, gradient boosted trees, and neural networks. Results Greater quantities of missing data were associated with worse performance. Sparsity in audiometric data is not uniformly distributed, as inter-octave frequencies are less commonly tested. With 3–8 missing features per instance, a real-world sparsity distribution was associated with significantly better performance compared to other sparsity distributions (Δ RMSE 0.3 dB– 5.8 dB, non-overlapping 99% confidence intervals). With a real-world sparsity distribution, models were able to safely impute up to 6 missing datapoints in an 11-frequency audiogram. MICE consistently outperformed other models across all metrics and sparsity distributions (p < 0.01, Wilcoxon rank sum test). With sparsity capped at 6 missing features per audiogram but otherwise equivalent to the raw dataset, MICE imputed with RMSE of 7.83 dB [95% CI 7.81–7.86]. Imputing up to 6 missing features captures 99.3% of the audiograms in our dataset, allowing for a 5.7-fold increase in dataset size (1,304 to 7,399 audiograms) as compared with complete case analysis. Conclusion Precision medicine will inevitably play an integral role in the future of hearing healthcare. These methods are data dependent, and rigorously validated imputation models are a key tool for maximizing datasets. Using the largest CI audiogram dataset to-date, we demonstrate that in a real-world scenario MICE can safely impute missing data for the vast majority (>99%) of audiograms with RMSE well below a clinically significant threshold of 10dB. Evaluation across a range of dataset sizes and sparsity distributions suggests a high degree of generalizability to future applications.


Introduction
Cochlear implantation (CI) is considered one of modern medicine's highest achievements, capable of restoring hearing in patients with severe-to-profound hearing loss. As of December 2019, there have been over 736,000 registered devices implanted worldwide; in the United States, approximately 118,100 adults and 65,000 children have been implanted [1]. However, despite this success, big data research in CI patient care is essentially non-existent. Published studies typically range from single digits to 100-200 patients due to non-standardized protocols, missing data, and lack of collaboration [2,3]. This has major implications for outcomes research, population monitoring, and identification of potential CI candidates.
Data-centric statistical approaches are providing new opportunities to operationalize big data to improve patient care. In the field of CI research, machine learning models may offer novel insights into CI speech performance prognostication. However, missing data must first be appropriately dealt with [4]. Few studies directly address the development and validation of imputation methods, with potentially drastic implications for prediction model performance [4][5][6]. Missing data can bias results and lead to inefficient analyses though loss of precision and power [7,8]. Common methods for handling missing data include complete case analysis, missing indicators, and univariate imputation. While simple, these methods can introduce significant bias [5,6,9,10]. More sophisticated multivariate models use subjects' other known characteristics to increase imputation accuracy [5].
In addition to model selection, imputation accuracy depends on structural characteristics of data, such as feature intercorrelation, dataset size, amount of missing data, and distribution of sparsity [8]. However, patterns of missing audiometric data in patients undergoing CI evaluation have yet to be thoroughly characterized. Prior work on imputing missing audiometric data is also limited. Charih et al compared linear interpolation to k-nearest neighbors (KNN) for prediction of 3000Hz and 6000Hz with no difference in mean absolute error (MAE) found [11]. Reported MAE ranged from 5.38 dB to 7.36 dB depending on the model and frequency output. However, all atypical audiograms were algorithmically excluded from analysis using a gaussian mixture model, potentially inflating performance. Pitathawatchai et al compared models for prediction of certain frequencies in 206 pediatric audiograms using specific input frequencies [12]. They found that machine learning models such as neural networks and KNN were in aggregate (MAE 5-8 dB) superior to linear interpolation (MAE 6.25-10 dB). However, the small sample size and homogenous pediatric population limits generalizability.
Notably, all prior studies have assessed prediction of specific frequency outputs using specific frequencies as input. No study has attempted to simulate real-world patterns of missing data for imputation model assessment. This presents a significant barrier to implementation, as an understanding of real-world model performance is a critical prerequisite for use in clinical research. CI candidates also represent a potentially more challenging subset of patients, as varying degrees and etiologies of hearing loss can lead to more variance in audiometric thresholds as compared with the general population. While of particular interest, to date no study has looked at imputing missing data in this specific population. Finally, the field of machine learning, particularly in the context of medical data, suffers from a lack of standardization of methods to rigorously validate models. If imputation is to be employed to leverage larger datasets for CI research, proper validation of imputation techniques is critical to ensure results are generalizable, clinically relevant, and statistically sound.
The contributions of the study are as follows: 1. Proposal of a model for imputation model validation on randomly generated sparse datasets with structural sparsity equivalent to real-world data. Publication of code to facilitate validation of real-world imputation model performance on novel datasets.
2. Assessment of the independent effects of sparsity distribution, amount of missing data, and dataset size on imputation model performance.
3. Demonstration that, for this dataset, up to 6 missing frequencies per 11-frequency audiogram can be safely imputed, allowing for a 5.7-fold increase in sample size.
4. Demonstration that, depending on dataset size and sparsity, either interpolation or MICE is optimal for audiometric imputation; complex, black-box machine learning models are unnecessary.
5. We make available the largest CI audiometric dataset to-date, which may be used for outcomes modeling and to bolster imputation accuracy for other audiometric datasets. Invalid values (e.g., text or extreme numbers) were removed. Audiogram values were clipped to a range of 0 dB to 120 dB to account for inter-site variation in the reporting of extreme values. "No response" thresholds were imputed as 120 dB, the maximum value recorded by most audiometers. Removal of blank instances yielded a sparse "parent" dataset of 7,451 audiograms, used to calculate the distribution and degree of sparsity in real-world CI audiometric data. Complete case analysis yielded a dense dataset of 1,304 audiograms for model assessment (Fig 1). The sparse parent dataset with associated demographic data is available in supplementary files.

Model assessment
Models were assessed with 10 simulations of nested 10-fold by 10-fold cross-validation (Fig 2). 10-fold cross validation has been shown to represent a good trade-off between bias, variance, and computational cost [14]. Instances were randomly shuffled prior to each simulation such that fold partitions varied from run to run, as reporting the average and significance of repeated runs of k-fold cross-validation has been shown to significantly decrease variance of results [15]. Additionally, repeating runs helps account for the stochasticity of randomly generated sparse datasets. Nested cross-validation allows for model assessment across the entire dataset while avoiding data leakage from hyperparameter optimization, leading to more accurate performance estimates [16].

PLOS ONE
For each fold, the parent dataset (n = 7,451) was divided into training (90%) and test (10%) subsets, with test subsets sampled without replacement across folds. Sparsity distribution was calculated independently for each subset; incomplete instances were subsequently removed to create dense subsets. This was done following the split into train and test subsets to avoid potential data leakage. Sparse datasets with known underlying values were created from dense subsets by randomly deleting values from each instance. When randomly generating sparse datasets, two axes of sparsity were independently controlled for: quantity (how many features are missing per instance) and distribution (which features are relatively more likely to be missing). Four sparsity distributions were tested (Fig 3). Tunable hyperparameters were optimized on the randomly generated sparse train subset with 10-fold crossvalidation (S1 Table). The optimized model was used to predict imputations for the sparse test subset. Simulation performance metrics were measured across all 10 test subsets in aggregate. The mean and confidence interval of model performance across all simulations was reported as output.

Metrics
Coefficient of determination (R 2 score) describes the proportion of variance explained by the model (Eq 1). R 2 score is useful for comparing performance across dependent variables with varying degrees of noise. Mean absolute error (MAE) is the average absolute magnitude of residual error (Eq 2). While directly interpretable, MAE is robust to outliers, potentially masking critically high individual errors. Root mean squared error (RMSE) is the square root of the sum of squared residual errors (Eq 3). The squared error term more heavily penalizes higher individual errors, giving a more conservative estimate of performance.
ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi

Error threshold
We defined a clinically meaningful change in an audiometric threshold as �10dB, based on two factors. First, the established intra-session test-retest reliability of routine audiometry is +/-5dB [17,18]. Second, the American Speech-Language-Hearing Association (ASHA) defines a clinically significant hearing change for ototoxicity monitoring as 10dB across two frequencies or 20dB in one frequency [19]. The test-retest reliability of +/-5dB introduces intrinsic noise, representing a theoretical lower bound for model MAE. Assuming the +/-5dB of noise holds true for this dataset, model performance significantly below this threshold would indicate overfitting or data leakage. The 10dB clinically significant threshold is used as an upper bound to determine the amount of missing data that can be safely imputed while avoiding clinically significant bias. RMSE is used to define the upper bound as it is more conservative.

Imputation algorithms
Six models were evaluated: univariate imputation (UI), interpolation (INT), k-nearest neighbors (KNN), multiple imputation by chained equations (MICE), neural networks (NN), and gradient boosted trees (XGB). INT was implemented with the SciPy scientific computing library version 1.7 [20]. UI, KNN, MICE, and NN models were implemented with the Scikitlearn machine learning library version 1.0.2 [21]. XGB was implemented with the XGBoost Python API version 1.6.1 [22]. All testing was performed using the Python programming language version 3.9.12 (Python Software Foundation).

Univariate imputation
UI is a single-round imputation method that imputes values for a feature using only that same feature dimension. UI is a common alternative to complete case analysis, representing an important benchmark for the performance of more complex models. The UI hyperparameter search space included the imputation strategy: feature mean, median, and mode (S1 Table).

Interpolation
INT imputes missing data using non-missing data from that same instance only (S1 Fig).
Interpolants defined between adjacent non-missing datapoints are used to impute interposed missing datapoints. Missing datapoints outside the range of non-missing datapoints are imputed by extrapolating the nearest interpolant. Linear interpolation defines interpolants as first-degree linear equations between two known coordinates (x 0, y 0 ) and (x 2, y 2 ). For x 1 in the interval (x 0 , x 2 ), y 1 is defined by Eq 4. Cubic spline and piecewise cubic hermite interpolating polynomial (PCHIP) generate piecewise cubic polynomials as interpolants (Eq 5). Cubic spline interpolants have continuous first and second derivatives. PCHIP interpolants are continuous in the first derivative only, avoiding overshooting and preserving monotonicity. The INT hyperparameter search space included the interpolation method: linear, PCHIP, and cubic spline (S1 Table).

K-nearest neighbors
KNN is a single-round imputation method which imputes missing values using the k instances most similar to the instance being imputed. For a sparse instance x s , distance from an instance x i is calculated as the average distance in Euclidean space for all shared non-missing features.
The nearest neighbors are defined as the k instances with the lowest averaged distances from x s . Missing features in x s are imputed using the unweighted or distance-weighted means of the k-nearest neighbors (Eq 6). The unweighted model assigns an identical weight w ¼ 1 k to each k-nearest neighbor, such the relative importance of each nearest neighbor is equivalent. The weighted model assigns a weight to the ith nearest neighbor inversely proportional to the distance between x i and x s (Eqs 7 and 8). The KNN hyperparameter search space included the number of neighbors and the weighting function (S1 Table).
x s = Sparse instance to be imputed x i = ith nearest neighbor to x s w i = weight of ith instance k = number of nearest neighbors d(x s , x i ) = average feature distance between x s and x i

Gradient boosted trees
Decision trees are visualizable as a flowchart of yes or no decisions. While decision trees tend to overfit and are often poorly generalizable, ensembles are more robust [23]. Gradient boosted trees are a type of ensemble wherein individual decision trees are iteratively added to the model such that each new tree reduces the loss function, accounting for the error of the existing ensemble. We used the gradient boosted trees model XGBoost (XGB), described in Tianqi et al, 2016 [22]. As XGBoost outputs a single prediction, separate models were trained for each feature using the Scikit-learn multioutput regressor [24]. The XGB hyperparameter search space included the number of estimators, maximum tree depth, learning rate, and subsample proportion (S1 Table).

Neural network
A feed-forward multilayer perceptron was modeled after the NN described in Pitathawatchai et al (Adam optimizer, rectified linear unit activation function), shown to be superior to the common approach for audiometric frequency prediction in pediatric audiograms [12,25]. As NNs are intolerant of missing inputs, missing values were denoted with -1. The NN hyperparameter search space included the number of hidden layers, the number of nodes in each hidden layer, and the learning rate (S1 Table).

Multiple imputation by chained equations
Multiple imputation by chained equations is a well-validated method for missing data imputation [26]. Initially, all missing values are imputed using single imputation. Subsequently, all originally missing values are iteratively re-imputed by a series of multivariate regression models that impute values in a given feature using all other features as inputs. We tested l1 and l2 regularized linear regression models as the estimator (Eqs 9 and 10), as both are appropriate for data with high degrees of multicollinearity. Values are imputed for each feature in a roundrobin fashion; imputation of every feature once represents a single round of imputation. This process iteratively repeats for m rounds, generating m imputations for each originally missing value. The final round of imputation, presumed to be the most accurate, is commonly used as model output. Alternatively, Single Center Imputation from Multiple Chained Equations (SICE) defines model output as the mean of all imputations [27]. The MICE hyperparameter search space included the estimator, number of iterations, initial imputation function, and output method (S1 Table).

Quantity and distribution of sparsity
The amount of missing data significantly affects imputation accuracy. However, there is no standard for the amount of missing data that can be safely imputed. To address this, we first

PLOS ONE
defined a clinically significant threshold for "safe" audiometric imputation (< 10dB RMSE). Model performance was then assessed with increasing amounts of missing data to quantify the amount of missing data that can be safely imputed. Another important consideration is the distribution of missing data. One common approach for comparing imputation algorithms is to take a MCAR assumption and randomly introduce missing values [28]. However, this does not reflect real-world audiometric data, as certain frequencies are significantly more commonly tested. To assess real-world performance, we tested randomly generated sparse datasets with sparsity distributions weighted to match that of our parent dataset. Three other sparsity distributions (random, terminal, central) were tested to assess potential generalizability to novel datasets with different sparsity distributions (Fig 3). As expected, model performance degraded with increasing amounts of missing data (Fig 5). With 3 to 8 missing features, all models (except UI) performed significantly better with a realworld sparsity distribution compared to random and weighted-random sparsity distributions (Δ RMSE 0.3 dB-5.8 dB, non-overlapping 99% confidence intervals). With a real-world sparsity distribution, it is possible to safely impute up to 6 missing datapoints per audiogram using MICE, KNN, XGB, and NN. For all other sparsity distributions, the maximum for safe imputation ranged from 2 to 5 missing features, depending on model (excluding UI) and distribution ( Fig 5). This highlights the importance of considering sparsity distribution when evaluating imputation models, particularly with audiometric data.
Audiometric data is highly multicollinear. Adjacent frequencies are most strongly correlated, and correlation strength falls off dramatically with increasing distance on the frequency spectrum (Fig 6). For example, the Pearson correlation coefficient between 125 Hz and 250 Hz is 0.92; comparatively, the correlation between 125 Hz and 3000 to 8000 Hz ranges from 0.21 to 0.22. We hypothesize that this is likely a driving factor behind the favorability of imputing missing audiometric data with a real-world sparsity distribution. Random and weighted-random sparsity distributions are statistically more likely to have multiple sequential missing frequencies, compared to the real-world distribution of sparsity which is strongly weighted towards alternating present (octave) and absent (inter-octave) frequencies (Figs 3 and 4D).

Model selection
Sparsity of real-world data varies between instances both spatially ( Fig 4D) and quantitatively ( Fig 4A). To identify the best-performing model in a real-world scenario, models were tested on randomly generated sparse datasets with real-world spatial and quantitative distributions of sparsity, such that both the number and specific distribution of missing features varied between audiograms. Quantitative sparsity was capped at six missing features per instance, as this represents the maximum for safe imputation given a real-world spatial distribution of sparsity (Fig 5). MICE was the best performing model across all error metrics (p < 0.01, Wilcoxon rank sum test) with RMSE of 7.83 dB [95% CI 7.81-7.86] ( Table 1). To assess generalizability of model selection to novel datasets, testing was repeated on randomly generated sparse datasets with other sparsity distributions (random, terminal, and central). As above, the number of missing features per instance was randomized on a per audiogram basis, capped at six but otherwise mirroring the quantitative sparsity of the parent subset. MICE was the best performing model across all error metrics (p < 0.01, Wilcoxon rank sum test) for each sparsity distribution tested (Table 1).
Prior literature states that machine learning models such as KNN and neural networks are optimal. These results directly conflict with this notion, indicating that simple iterative linear regression produces optimal results, and that complex machine learning methods are not just unnecessary but detrimental. Overall, audiometric data is highly multicollinear, making it an

PLOS ONE
excellent candidate for imputation. This is further bolstered by the favorable sparsity distribution of real-world audiometric data. Imputing missing data in audiograms with up to 6 missing datapoints captures 99.3% of the audiograms in our dataset, allowing a 5.7-fold increase in sample size (1,304 to 7,399 audiograms) while maintaining a clinically insignificant level of error. These results also suggest an avenue for increased clinical efficiency for audiologists, as an accurate full 11-frequency audiogram could hypothetically be obtained by measuring only 5 frequencies and imputing the rest.

Dataset size
Dataset size is another consideration, as certain methods (e.g., neural networks) are known to require more data for optimal performance. Models were tested on datasets ranging in size

PLOS ONE
from 20 to 1280 audiograms (Fig 7). Sparsity was capped at 6 missing datapoints per audiogram; otherwise, sparsity quantity and distribution were equivalent to the parent dataset. As expected, performance of UI and INT models was independent of dataset size; all other models displayed a positive correlation between dataset size and performance. For datasets smaller than~1e2, INT was the best performing model. In all other cases, MICE was the best performing model. Notably, MICE performance plateaus at a dataset size of approximately 300-400 audiograms. In contrast, the more sophisticated machine learning models (KNN, XGB, and NN) continue to improve with increasing dataset size. This indicates the possibility that, with a significantly larger dataset, another model could outperform MICE. However, the performance of MICE on the entire dataset with up to 6 missing features is MAE 5.08 [95% CI 5.07-5.08] ( Table 1). It is not possible to impute with greater accuracy than the stochastic noise (test-retest error) of the underlying dataset, which prior studies define as roughly +/-5dB [17,18].
As such, MICE is close to the theoretical upper bound for model performance. For improvement with a more complex model to be significant enough to justify the cost of interpretability and ease of implementation, the literature-defined test-retest error would need to be an overestimate. However, a better model could potentially safely impute audiograms Table 1. Model selection. Model performance assessment given different sparsity distributions. Quantity of missing data varied on a per-instance basis, capped at 6 missing features but otherwise statistically equivalent to parent subset. Results averaged across 10 simulations, reported as mean (95% confidence interval).

PLOS ONE
with greater than 6 missing datapoints. While this represents only 0.7% of audiograms in our dataset, this could be applicable for audiometric screening, potentially increasing audiologists' clinical efficiency by decreasing the number of frequencies needed for a full audiogram.

Limitations and future directions
Generalizability is a primary concern for implementation of any predictive model. While our dataset is well-represented, future datasets may have different structural sparsity as practices for audiometric testing change with time. As such, blind implementation is not recommended. An analysis of sparsity and model performance is critical to avoid inadvertently introducing bias. The attached GitHub repository contains code and instructions such that researchers may easily perform these analyses on novel datasets. Audiometric testing practices also differ between institutions. Complete case analysis was used prior to analysis, potentially introducing selection bias. As institution-specific data is unavailable, the degree of selection bias could not be assessed. Furthermore, the study population was relatively homogenous; most patients were white or of unknown race with hearing loss predominantly of unknown etiology, all undergoing evaluation for cochlear implant. While results support imputation of other audiometric data, future work is needed to validate use in other populations. We hypothesize the decreased variance of audiometric thresholds in healthy patients would lead to increased performance.
Results were derived from imputation of full 11-frequency audiograms. However, investigators may decide to exclude certain frequencies. To assess generalizability of results with a restricted set of features, models were assessed for imputation of audiograms with 750 Hz and 1500 Hz (the two least commonly tested features) removed. This additionally allowed for assessment with a larger dataset, expanding sample size to 3,074. Results show that imputation with this feature set is comparatively more difficult. Given a real-world sparsity distribution, six non-missing datapoints were required for safe imputation, up from five non-missing datapoints with the 11-frequency dataset (S2 Fig). This suggests that a safer approach to the 11-frequency dataset would be capping quantitative sparsity at 5 rather than 6 missing features per instance, which still captures the majority (98.8%) of audiograms. Some may also disagree with a clinically significant threshold of �10dB RMSE. However, should investigators choose alternative definitions for clinically significant error, the threshold for error tolerance and the amount of missing data imputed may easily be adjusted.
Finally, models were optimized by grid-search. Certain models, such as XGB and NN, have many hyperparameters. A complete grid-search is not feasible, as grid-search size increases exponentially with additional parameters. As such, grid-search potentially underestimates performance of more complex models. To address this, we systematically identified all tunable hyperparameters prior to each test, such that 1) variance significantly (>1dB RMSE) affected model performance, and 2) there was no unilaterally superior value to serve as a default (S1 Table). Despite limitations, given the size of our dataset we believe the gain in generalizability from testing on all datapoints is worthwhile. Furthermore, our priority is avoiding performance overestimation, as a conservative understanding of performance is a key for incorporation of models into clinical research. Correspondingly, the parametric simplicity of INT, MICE, and KNN models represents an advantage over the more complex XGB and NN Model performance with varying dataset size. Models assessed on sparse datasets with sample size sequentially increasing twofold. Amount of sparsity varied on a per-instance basis from 1 to 6 missing features, mirroring the quantity of sparsity of the parent subset. Distribution of sparsity was real-world, mirroring parent subsets. Lines represent metric mean across 10 simulations; shaded bands represent 99% confidence intervals.
https://doi.org/10.1371/journal.pone.0281337.g007 models. However, future work is needed to refine the process of identifying tunable hyperparameters, address the computational cost associated with grid search, and assess this approach to hyperparameter tuning across a range of datasets. Similarly, while the importance of sparsity distribution is evident in audiometric data, further work is needed to assess the effects of sparsity distribution on other datasets.
This study validates imputation of a significant amount of CI audiometric data, addressing the prevalent issue of missing data in CI outcomes research. A next step is implementation of imputation models to increase sample size for CI outcomes modeling. Finally, future work should assess the degree to which audiograms can be used to safely impute other preoperative factors known to be associated with audiometry, such as hearing loss duration and etiology, age, and speech perception measurements.

Conclusions
Precision medicine has immense potential; however, these approaches are data-dependent and widespread sparsity in medical data represents a significant barrier to implementation. The current standard of complete case analysis leads to selection bias and decreased statistical power. The high degree of multicollinearity and favorable sparsity distribution of audiometric data makes it an excellent candidate for imputation. However, validation of imputation models is both necessary and often overlooked. We rigorously assessed the performance of six imputation models on CI candidate audiograms, demonstrating the importance of considering both the quantity and distribution of missing data. With a real-world sparsity distribution, up to 6 of 11 frequencies per audiogram can be safely imputed while maintaining RMSE below a clinically significant threshold of 10dB, allowing for a 5.7-fold increase in dataset size. The standard approach of complete case analysis leads to discarding of significant amounts of usable data. Univariate imputation, a common alternative to complete case analysis, introduces a significant amount of bias, is outperformed by all other models tested, and should not be used. Depending on dataset size and sparsity, optimal performance can be obtained with either interpolation or iterative linear regression. More sophisticated machine learning techniques, demonstrated by prior literature to be optimal, are unnecessary and detrimental. Results, derived using the largest CI dataset to date, are likely generalizable. However, the importance of validating imputation model performance on novel datasets prior to implementation cannot be understated.
Supporting information S1 Fig. Interpolation models. Visualization of interpolants computed using linear, PCHIP, and cubic spline interpolation methods for a sample audiogram with six non-missing datapoints. Interpolants bounded by range of underlying data (0dB to 120dB). (TIF) S2 Fig. 9-frequencies, rate and distribution analysis. Assessment of model performance on sparse datasets with different degrees of sparsity (1-8 of 9 features) and sparsity distributions (Real-world, Random, Terminal, Central). Colored lines denote mean root mean squared error; shaded bands represent 99% confidence intervals. (TIF) S1